Introduction

This is the 4th project of Udacity’s Data Analyst NanoDegree Program. We were given several data sources as options to analyze from. I chose the Arizona’s 2016 Presidential Campaign Finance from the Federal Election Commission (FEC) website.

The format of this analysis will be as the following:

1- I will declare my intentions with a hypothesis (if applicable)

2- Insert a R snippet/code and run it

3- Declare my findings.

And so on…

First I will begin the analysis by exploring basic statistics about the data set. This will help me see the nature of the data, and whether the data needs cleaning or wrangling. Afterwards, I will explore variable and multivariate relationships, by using the methods I have learned in chapter 4, such as scatter, line, box plots and histograms. This is the basic outline of the analysis, but surely I will find interesting things to talk about along the way.

Data Preparation and Munging

# import data and packages
az <- read.csv('P-AZ1')
az$x <- NULL

library(ggplot2)
library(plyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(memisc)
## Loading required package: lattice
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## 
## Attaching package: 'memisc'
## The following objects are masked from 'package:dplyr':
## 
##     collect, recode, rename
## The following object is masked from 'package:plyr':
## 
##     rename
## The following objects are masked from 'package:stats':
## 
##     contr.sum, contr.treatment, contrasts
## The following object is masked from 'package:base':
## 
##     as.array
library(reshape2)
library(gender)
library(stringr)
require(magrittr)
## Loading required package: magrittr
require(zipcode)
## Loading required package: zipcode
require(tmap)
## Loading required package: tmap
require(glue)
## Loading required package: glue
## 
## Attaching package: 'glue'
## The following object is masked from 'package:dplyr':
## 
##     collapse
require(githubinstall)
## Loading required package: githubinstall
# Install choroplethrZip
#install.packages("devtools")
#library(devtools)
#install_github('arilamstein/choroplethrZip@v1.5.0')

x <- c("ggmap", "rgdal", "rgeos", "maptools", "dplyr", "tidyr", "tmap")
lapply(x, library, character.only = TRUE) # load the required packages
## 
## Attaching package: 'ggmap'
## The following object is masked from 'package:magrittr':
## 
##     inset
## Loading required package: sp
## rgdal: version: 1.2-15, (SVN revision 691)
##  Geospatial Data Abstraction Library extensions to R successfully loaded
##  Loaded GDAL runtime: GDAL 2.2.0, released 2017/04/28
##  Path to GDAL shared files: C:/Users/alema/Documents/R/win-library/3.4/sf/gdal
##  GDAL binary built with GEOS: TRUE 
##  Loaded PROJ.4 runtime: Rel. 4.9.3, 15 August 2016, [PJ_VERSION: 493]
##  Path to PROJ.4 shared files: C:/Users/alema/Documents/R/win-library/3.4/sf/proj
##  Linking to sp version: 1.2-5
## rgeos version: 0.3-26, (SVN revision 560)
##  GEOS runtime version: 3.6.1-CAPI-1.10.1 r0 
##  Linking to sp version: 1.2-5 
##  Polygon checking: TRUE
## Checking rgeos availability: TRUE
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
## 
##     extract
## The following object is masked from 'package:reshape2':
## 
##     smiths
## [[1]]
##  [1] "ggmap"         "githubinstall" "glue"          "tmap"         
##  [5] "zipcode"       "magrittr"      "stringr"       "gender"       
##  [9] "reshape2"      "memisc"        "MASS"          "lattice"      
## [13] "gridExtra"     "dplyr"         "plyr"          "ggplot2"      
## [17] "stats"         "graphics"      "grDevices"     "utils"        
## [21] "datasets"      "methods"       "base"         
## 
## [[2]]
##  [1] "rgdal"         "sp"            "ggmap"         "githubinstall"
##  [5] "glue"          "tmap"          "zipcode"       "magrittr"     
##  [9] "stringr"       "gender"        "reshape2"      "memisc"       
## [13] "MASS"          "lattice"       "gridExtra"     "dplyr"        
## [17] "plyr"          "ggplot2"       "stats"         "graphics"     
## [21] "grDevices"     "utils"         "datasets"      "methods"      
## [25] "base"         
## 
## [[3]]
##  [1] "rgeos"         "rgdal"         "sp"            "ggmap"        
##  [5] "githubinstall" "glue"          "tmap"          "zipcode"      
##  [9] "magrittr"      "stringr"       "gender"        "reshape2"     
## [13] "memisc"        "MASS"          "lattice"       "gridExtra"    
## [17] "dplyr"         "plyr"          "ggplot2"       "stats"        
## [21] "graphics"      "grDevices"     "utils"         "datasets"     
## [25] "methods"       "base"         
## 
## [[4]]
##  [1] "maptools"      "rgeos"         "rgdal"         "sp"           
##  [5] "ggmap"         "githubinstall" "glue"          "tmap"         
##  [9] "zipcode"       "magrittr"      "stringr"       "gender"       
## [13] "reshape2"      "memisc"        "MASS"          "lattice"      
## [17] "gridExtra"     "dplyr"         "plyr"          "ggplot2"      
## [21] "stats"         "graphics"      "grDevices"     "utils"        
## [25] "datasets"      "methods"       "base"         
## 
## [[5]]
##  [1] "maptools"      "rgeos"         "rgdal"         "sp"           
##  [5] "ggmap"         "githubinstall" "glue"          "tmap"         
##  [9] "zipcode"       "magrittr"      "stringr"       "gender"       
## [13] "reshape2"      "memisc"        "MASS"          "lattice"      
## [17] "gridExtra"     "dplyr"         "plyr"          "ggplot2"      
## [21] "stats"         "graphics"      "grDevices"     "utils"        
## [25] "datasets"      "methods"       "base"         
## 
## [[6]]
##  [1] "tidyr"         "maptools"      "rgeos"         "rgdal"        
##  [5] "sp"            "ggmap"         "githubinstall" "glue"         
##  [9] "tmap"          "zipcode"       "magrittr"      "stringr"      
## [13] "gender"        "reshape2"      "memisc"        "MASS"         
## [17] "lattice"       "gridExtra"     "dplyr"         "plyr"         
## [21] "ggplot2"       "stats"         "graphics"      "grDevices"    
## [25] "utils"         "datasets"      "methods"       "base"         
## 
## [[7]]
##  [1] "tidyr"         "maptools"      "rgeos"         "rgdal"        
##  [5] "sp"            "ggmap"         "githubinstall" "glue"         
##  [9] "tmap"          "zipcode"       "magrittr"      "stringr"      
## [13] "gender"        "reshape2"      "memisc"        "MASS"         
## [17] "lattice"       "gridExtra"     "dplyr"         "plyr"         
## [21] "ggplot2"       "stats"         "graphics"      "grDevices"    
## [25] "utils"         "datasets"      "methods"       "base"
# convert zip code to factor
az$contbr_zip <- factor(az$contbr_zip)
#convert from char to date class
az$proper_date <- as.Date(az$contb_receipt_dt, format = '%d-%B-%y')

The structure of the data is as the following The file has 19 variables, and these are the most important ones to for the analysis:

I wish if there was a party and a gender column, I will try to it below.

# Add party col
# Note code template was taken from Udacity Forums

index <- c("Johnson, Gary", "Stein, Jill", "McMullin, Evan")
dindex <- c("Clinton, Hillary Rodham", "Sanders, Bernard", "Lessig, Lawrence", "O'Malley, Martin Joseph", "Webb, James Henry Jr.")
rindex <- c('Bush, Jeb', "Carson, Benjamin S."
            , "Christie, Christopher J", "Cruz, Rafael Edward 'Ted'",
            "Fiorina, Carly", "Gilmore, James S III" ,
            "Graham, Lindsey O.", "Huckabee, Mike", 
            "Jindal, Bobby", "Kasich, John R.",
            "Paul, Rand", "Perry, James R. (Rick)",
            "Rubio, Marco", "Trump, Donald J.",
            "Walker, Scott" )
attach(az)
az$party[cand_nm %in% index] <- "independent"
az$party[cand_nm %in% dindex] <- "democrat"
az$party[cand_nm %in% rindex] <- 'republican'
detach(az)

# Convert party to factor
az$party <- factor(az$party)

I also would like to add other information such as latitudes and longitudes for map analysis

data(zipcode)
az_loc <- subset(zipcode, state == 'AZ')
az$clean_zip <- substring(az$contbr_zip, 1, 5)
az <- merge(az, az_loc, by.x = 'clean_zip', by.y = 'zip')

I would also like to integrate population data by zip-code from the 2010 ZCTA census.

# Add Gender col to candidates

# Gender indices
m_index <- c('Bush, Jeb', "Carson, Benjamin S."
            , "Christie, Christopher J", "Cruz, Rafael Edward 'Ted'",
            "Gilmore, James S III" ,
            "Graham, Lindsey O.", "Huckabee, Mike", 
            "Jindal, Bobby", "Kasich, John R.",
            "Paul, Rand", "Perry, James R. (Rick)",
            "Rubio, Marco", "Trump, Donald J.",
            "Walker, Scott", "Sanders, Bernard",
            "Lessig, Lawrence", "O'Malley, Martin Joseph",
            "Webb, James Henry Jr.", "Johnson, Gary",
            "McMullin, Evan"
            )

f_index <- c("Clinton, Hillary Rodham", "Fiorina, Carly", "Stein, Jill" )



#simple cand_gender
az$cand_gender <-NA
attach(az)
az$cand_gender[cand_nm %in% m_index] <- "Male"
az$cand_gender[cand_nm %in% f_index] <- "Female"
detach(az)
# convert cand_gender to factor
az$cand_gender <- factor(az$cand_gender)

Now that I added candidates’ genders, I’ll add the contributors’ genders, by using the gender package.

# contributers' genders



# Get first names in a seperate col
az$first_name <- str_split_fixed(az$contbr_nm, ", ", 2)[,2]

# Use gender function
gender_df <- gender(as.character(az$first_name), c(1932, 1998),
countries= "United States")

# Assign gender to contributers in az df
names(gender_df)[1] = "first_name"
names(gender_df)[4] = 'contrib_gender'
gender_df <- unique(gender_df)
az <- merge(az, gender_df[ c("first_name", "contrib_gender")])

# convert contrib_gender to factor
az$contrib_gender <- factor(az$contrib_gender)

Exploratory Data Analysis

# Create a function that gets  top ten counts
top10 <- function(x){
  y <-table(x)
  y <- sort(y, decreasing = T)
  y <- as.data.frame(y)
  y[1:10,]

  
  
}
  
  
  

# count city occurences in data set
city_tab <- top10(az$contbr_city)


ggplot(city_tab, aes(x, Freq))+
  geom_bar(stat='identity')

# top contribs counts
top_parties <-top10(az$party)

ggplot(top_parties, aes(x, Freq))+
  geom_bar(stat='identity')
## Warning: Removed 7 rows containing missing values (position_stack).

city_tab

I would like to know how the data is distributed.

#money disterbution

ggplot(az, aes(contb_receipt_amt))+
  geom_histogram(binwidth = 15)+
  coord_cartesian(xlim=c(0:500))

#money disterbution normalized


ggplot(az, aes(contb_receipt_amt))+
  geom_histogram()+
  scale_x_log10()
## Warning in self$trans$transform(x): NaNs produced
## Warning: Transformation introduced infinite values in continuous x-axis
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1513 rows containing non-finite values (stat_bin).

ggplot(az, aes(contb_receipt_amt))+
  geom_freqpoly()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Date distribution

#Date disterbution
ggplot(az, aes(proper_date))+
  geom_histogram(binwidth = 20)

It appears that we have negative numbers, that goes all to -5400. I believe that it represents refunds, since the most receipt comment is receipt.

I want to find out the amount stats without the refunds.

# most popular bill denomination/   
summary(subset(az$contb_receipt_amt, az$contb_receipt_amt > 0 ))
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.04    15.00    27.00    80.21    61.86 10800.00
non_zero_rec <- subset(az$contb_receipt_amt, az$contb_receipt_amt > 0 )
tab <- table(non_zero_rec)
str(tab)
##  'table' int [1:2224(1d)] 1 1 2 2 1 25 1 861 1 1 ...
##  - attr(*, "dimnames")=List of 1
##   ..$ non_zero_rec: chr [1:2224] "0.04" "0.12" "0.24" "0.5" ...
tab[tab>1000]
## non_zero_rec
##     3     5     8    10    15    19    20    25    27    28    35    38 
##  1126  7727  1869 11486  5699  2539  2638 17543  5432  2258  2322  1014 
##    40    50    75    80   100   200   250   500 
##  2340 13346  1325  2038 11547  1941  4095  1380
top_deno<- top10(az$contb_receipt_amt)

ggplot(top_deno, aes(x, Freq))+
  geom_bar(stat='identity')

I wonder why the odd numbers such as 19, 27 or even 38.

Below we will see the number of contributions for each candidate and how they break out:

# Most count of cont.
sort(summary(az$cand_nm), decreasing= T)
##   Clinton, Hillary Rodham          Sanders, Bernard 
##                     53861                     35784 
##          Trump, Donald J. Cruz, Rafael Edward 'Ted' 
##                     16087                      7129 
##       Carson, Benjamin S.              Rubio, Marco 
##                      2954                      1657 
##            Fiorina, Carly                Paul, Rand 
##                       462                       426 
##             Johnson, Gary           Kasich, John R. 
##                       318                       263 
##               Stein, Jill                 Bush, Jeb 
##                       199                       122 
##            Huckabee, Mike            McMullin, Evan 
##                       101                        98 
##             Walker, Scott   O'Malley, Martin Joseph 
##                        95                        29 
##  Christie, Christopher J.      Santorum, Richard J. 
##                        19                        19 
##             Jindal, Bobby        Graham, Lindsey O. 
##                        10                         9 
##     Webb, James Henry Jr.          Lessig, Lawrence 
##                         5                         4 
##    Perry, James R. (Rick)      Gilmore, James S III 
##                         1                         0
# top  cand in terms of money
cand_groups <- group_by(az, cand_nm) 

cand_sum <-summarize(cand_groups, 
                      mean(contb_receipt_amt),
                      n = n())


tot_rec <- aggregate(az$contb_receipt_amt, list(az$cand_nm), sum)


tot_rec <- arrange(tot_rec, desc(x))

tot_rec
# Top in money recieved histogram

ggplot(data= tot_rec,
       aes(reorder(Group.1, -x), x)) +
  geom_bar(stat="identity")

# Normalized
ggplot(data= tot_rec,
       aes(reorder(Group.1, -x), x)) +
  geom_bar(stat="identity")+
  scale_y_log10()

I would like to see the box-plot of each gender/party contribution.

ggplot(az[!is.na(az$party) & az$party != 'independent',], aes(x=contrib_gender, y=contb_receipt_amt))+
  geom_boxplot(varwidth=T)+
  facet_grid(~party)+
  coord_cartesian(ylim=boxplot.stats(az$contb_receipt_amt)$stats[c(1, 5)]
)

Republicans contributed more on average, and they had a higher range of contribution amounts. Male republicans contributed slightly more on average than their female counterparts.

table(az$contrib_gender)
## 
## female   male 
##  65569  54083

I did not expect to find more female contributors than males in this data-set.

Lets explore if females were more likely to vote for female candidates.

same_sexf<- subset(az, az$contrib_gender == 'female' & az$cand_gender== 'Female')

length(same_sexf$contrib_gender)/length(az$contrib_gender[az$contrib_gender == 'female'])
## [1] 0.546188
same_sexm<- subset(az, az$contrib_gender == 'male' & az$cand_gender== 'Male')

length(same_sexm$contrib_gender)/ length(az$contrib_gender[az$contrib_gender == 'male'])
## [1] 0.653514
ggplot(az[!is.na(az$contrib_gender) & !is.na(az$cand_gender),] , aes(cand_gender, contrib_gender))+
  geom_bin2d()

%54.5 females of this data-set contributed to females, while %65.4 of males contributed to males, which is a negligible preference.

# Create the mode function.
getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

# get distinct empoyee contrib from each employer

employer<- summary(az$contbr_employer)

                                  
# most genrous occupations and account of total 
                                  

#subset academia employees                               
academia <- subset(az, grepl('UNIVERSITY|COLLEGE', az$contbr_employer, ignore.case = T))


# group by university/college
cand_acad <- group_by(academia, contbr_employer)

# how much is averagly contributed by academia
spend_acad <- summarise(cand_acad,
                        number_of_contributions = n(),
                        avg_spent = mean(contb_receipt_amt),
                        main_cand = getmode(cand_nm),
                        party = getmode(party)
                        
                        )

# sort by num of contribs
attach(spend_acad)
spend_acad <- spend_acad[order(-number_of_contributions),]
detach(spend_acad)
spend_acad
# percentage of demo/repub colleges
percent(spend_acad$party)
##    democrat independent  republican           N 
##   80.341880    1.709402   17.948718  117.000000
# Percentage of candidate preferance
sort(percent(academia$cand_nm)[1:24],decreasing = T)
##   Clinton, Hillary Rodham          Sanders, Bernard 
##               59.54716981               36.50314465 
## Cruz, Rafael Edward 'Ted'          Trump, Donald J. 
##                1.43396226                1.40880503 
##       Carson, Benjamin S.              Rubio, Marco 
##                0.42767296                0.20125786 
##               Stein, Jill                Paul, Rand 
##                0.20125786                0.15094340 
##        Graham, Lindsey O.            Fiorina, Carly 
##                0.05031447                0.02515723 
##             Johnson, Gary           Kasich, John R. 
##                0.02515723                0.02515723 
##                 Bush, Jeb  Christie, Christopher J. 
##                0.00000000                0.00000000 
##      Gilmore, James S III            Huckabee, Mike 
##                0.00000000                0.00000000 
##             Jindal, Bobby          Lessig, Lawrence 
##                0.00000000                0.00000000 
##            McMullin, Evan   O'Malley, Martin Joseph 
##                0.00000000                0.00000000 
##    Perry, James R. (Rick)      Santorum, Richard J. 
##                0.00000000                0.00000000 
##             Walker, Scott     Webb, James Henry Jr. 
##                0.00000000                0.00000000
# 

Around %80 of colleges had a democratic preference. The majority of %59 of contributions were for Clinton, Sanders cones in second of %37. Cruz came in third (%1.42) and Trump close fourth (%1.4).

Below I will find the stats of homemakers and retirees

# Homemaker stats
homemaker <- subset(az, grepl('HOMEMAKER', az$contbr_occupation, ignore.case = T ))
summary(homemaker)
##   first_name         clean_zip              cmte_id         cand_id   
##  Length:1076        Length:1076        C00575795:600   P00003392:600  
##  Class :character   Class :character   C00574624:146   P60006111:146  
##  Mode  :character   Mode  :character   C00577130:105   P60007168:105  
##                                        C00573519: 99   P60005915: 99  
##                                        C00580100: 84   P80001571: 84  
##                                        C00458844: 14   P60006723: 14  
##                                        (Other)  : 28   (Other)  : 28  
##                       cand_nm                         contbr_nm  
##  Clinton, Hillary Rodham  :600   BORCH, INGER              : 39  
##  Cruz, Rafael Edward 'Ted':146   GUIDARELLI-AMBRAD, DEBORAH: 35  
##  Sanders, Bernard         :105   FRANK, GLORIA             : 29  
##  Carson, Benjamin S.      : 99   FRANZ, ROBIN              : 29  
##  Trump, Donald J.         : 84   DOVER, RITA               : 28  
##  Rubio, Marco             : 14   BADE, KRISTI              : 27  
##  (Other)                  : 28   (Other)                   :889  
##           contbr_city  contbr_st     contbr_zip      contbr_employer
##  SCOTTSDALE     :212   AZ:1076   857507118: 39   N/A         :531   
##  TUCSON         :171             852533610: 35   HOMEMAKER   :306   
##  PHOENIX        :143             852043820: 29   RETIRED     : 59   
##  GILBERT        : 85             852951792: 29   NONE        : 41   
##  MESA           : 83             853021415: 28   NOT EMPLOYED: 39   
##  PARADISE VALLEY: 57             852543072: 27   MY CHILDREN : 25   
##  (Other)        :325             (Other)  :889   (Other)     : 75   
##                       contbr_occupation contb_receipt_amt
##  HOMEMAKER                     :1028    Min.   : -40.0   
##  UNEMPLOYED - HOMEMAKER        :  25    1st Qu.:  25.0   
##  HOMEMAKER / PHOTOGRAPHER / MSW:   5    Median :  50.0   
##  HOMEMAKER/ACTIVIST/ARTIST     :   5    Mean   : 137.1   
##  HUSBAND/MECHANICWIFE/HOMEMAKER:   5    3rd Qu.: 100.0   
##  HOMEMAKER/PHYSICIAN           :   3    Max.   :2700.0   
##  (Other)                       :   5                     
##   contb_receipt_dt
##  19-OCT-16:  14   
##  03-NOV-16:  12   
##  06-NOV-16:  12   
##  09-OCT-16:  12   
##  26-SEP-16:  12   
##  04-NOV-16:  11   
##  (Other)  :1003   
##                                                            receipt_desc 
##                                                                  :1076  
##  * EARMARKED CONTRIBUTION: SEE BELOW REATTRIBUTION/REFUND PENDING:   0  
##  * REATTRIBUTED FROM EDWARD FARMILANT                            :   0  
##  * REATTRIBUTED TO BARBARA FAMILANT                              :   0  
##  * REATTRIBUTED TO VICTORIA STRONG                               :   0  
##  EVENT PLANNING REATTRIBUTION FROM SPOUSE                        :   0  
##  (Other)                                                         :   0  
##  memo_cd                               memo_text    form_tp   
##   :906                                      :872   SA17A:912  
##  X:170   * EARMARKED CONTRIBUTION: SEE BELOW: 99   SA18 :164  
##          * HILLARY VICTORY FUND             : 98   SB28A:  0  
##          *BEST EFFORTS UPDATE               :  5              
##          *                                  :  1              
##          EARMARKED FROM MAKE DC LISTEN      :  1              
##          (Other)                            :  0              
##     file_num                       tran_id     election_tp
##  Min.   :1014598   C5628470            :   2        :  4  
##  1st Qu.:1077853   A105C04C73FFA4C859DB:   1   G2016:452  
##  Median :1109498   A6BF5A3EFECE4468B9E9:   1   O2016:  1  
##  Mean   :1103419   A85C4E16099CC4E5F8A1:   1   P2016:619  
##  3rd Qu.:1133930   AAA1CD0DBF8AB4B9281D:   1   P2020:  0  
##  Max.   :1146165   AFCCA0974E8D949428D0:   1              
##                    (Other)             :1069              
##   proper_date                 party         city          
##  Min.   :2015-04-01   democrat   :705   Length:1076       
##  1st Qu.:2016-02-27   independent: 14   Class :character  
##  Median :2016-06-21   republican :357   Mode  :character  
##  Mean   :2016-05-23                                       
##  3rd Qu.:2016-09-21                                       
##  Max.   :2016-12-02                                       
##                                                           
##     state              latitude       longitude      cand_gender 
##  Length:1076        Min.   :31.49   Min.   :-114.6   Female:602  
##  Class :character   1st Qu.:33.30   1st Qu.:-112.1   Male  :474  
##  Mode  :character   Median :33.49   Median :-111.9               
##                     Mean   :33.37   Mean   :-111.8               
##                     3rd Qu.:33.62   3rd Qu.:-111.7               
##                     Max.   :36.62   Max.   :-109.4               
##                                                                  
##  contrib_gender
##  female:1035   
##  male  :  41   
##                
##                
##                
##                
## 
# Percentage of homemaker genders
percent(homemaker$contrib_gender)
##      female        male           N 
##   96.189591    3.810409 1076.000000
# Homemaker contribs party leaning
percent(homemaker$party)
##    democrat independent  republican           N 
##   65.520446    1.301115   33.178439 1076.000000
# retired findings
retired <- subset(az, grepl('RETIRED', az$contbr_occupation, ignore.case = T))

#percentage of retirees in the data
print(
length(unique(retired$contbr_nm))/
  length(unique(az$contbr_nm)) * 100)
## [1] 35.57178
# Retired party leaning percentage
print (summary(retired$party)/ length(retired$party) *100)
##    democrat independent  republican        NA's 
## 60.00657639  0.44540101 39.52410845  0.02391415
# Retrired avg spending
mean(retired$contb_receipt_amt)
## [1] 81.05961

Homemakers are %96 females, and %65 of homemakers are democrats.

As we can see above, retirees make up about %35.6 of the data-set. Around %60 of retirees contributed to democrats and around %40 percent to republicans, contributions to independents are negligible. Retirees contributed $81 on average.

I want to know which occupations are most politically active, and how do they lean politically.

jobs <-data.matrix(summary(az$contbr_occupation))

head(jobs,10)
##                        [,1]
## RETIRED               32749
## NOT EMPLOYED          13737
## INFORMATION REQUESTED  3214
## ATTORNEY               2107
## PHYSICIAN              1897
## TEACHER                1821
## ENGINEER               1389
## CONSULTANT             1272
## PROFESSOR              1239
## SALES                  1238

The most politically active occupations in the data set are attorneys, physicians then teachers.

Below I would like to know the proportions of party leaning for each job. For example, of all engineers how many percent of them lean republican (number of republican engineers/ total number of engineers).

require(GGally)
## Loading required package: GGally
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa
require(gmodels)
## Loading required package: gmodels
# Create a DF for jobs and their party count
job_party_table<- as.data.frame.matrix(table(az$contbr_occupation, az$party))

# Create a DF with jobs and thier party percentages
job_party_dist<- prop.table(table(az$contbr_occupation, az$party), 1)

job_party_dist <- as.data.frame.table(job_party_dist)

# Merge the two DFs above
job_party_props<- data.frame(c(job_party_table, job_party_dist))

#Keep these cols
keep <- c('Var1','democrat', 'republican', 'Var2', 'Freq')
job_party_props <- job_party_props[,keep]

#change props to percentages
job_party_props$Freq <- job_party_props$Freq *100
# add a total_job col
job_party_props$total <- job_party_props$democrat + job_party_props$republican

#drop these cols
job_party_props$democrat <- NULL
job_party_props$republican <- NULL

# change Var2 col to party
names(job_party_props)[names(job_party_props) == 'Var2'] <- 'party'


# subset top 10 republican jobs
top10rep <-subset(job_party_props, party == 'republican' & total > 1000)

# create a pie chart
pie(top10rep$Freq, labels = top10rep$Var1)

#create a bar plot
ggplot(top10rep,
       aes(Var1, Freq))+
  geom_bar(stat = "identity")

# subset top 10 democrats jobs
top10demo <-subset(job_party_props, party == 'democrat' & total > 1000 & quantile(job_party_props$Freq, c(.660), na.rm = T))

# create a bar plot
ggplot(top10demo,
       aes(Var1, Freq))+
  geom_bar(stat = "identity")

# subset top 10 independents jobs
top10inde <-subset(job_party_props, party == 'independent' & Freq > quantile(Freq, probs = .62, na.rm = T) & total >100 & Var1 != 'N/A')

# create a bar plot
ggplot(top10inde,
       aes(Var1, Freq))+
  geom_bar(stat = "identity")

I would like to see average spending along dates

#group by avg week
require(lubridate)
## Loading required package: lubridate
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:memisc':
## 
##     is.interval
## The following object is masked from 'package:plyr':
## 
##     here
## The following object is masked from 'package:base':
## 
##     date
by_wk_sum <- tapply(az$contb_receipt_amt, week(az$proper_date), sum)

by_wk_avg <- tapply(az$contb_receipt_amt, week(az$proper_date), mean)


plot(by_wk_avg, type = 'l')

plot(by_wk_sum)

# note the weeks are aggregated by all years
az$week <-format(az$proper_date, format = "%W")
az$month <-format(az$proper_date, format = "%m")
az$year <-format(az$proper_date, format = "%y")

by_wk <- az %>% group_by(year = as.numeric(year), week= as.numeric(week), party) %>% summarise(sum = sum(contb_receipt_amt), 
                                                                                  avg = mean(contb_receipt_amt),
                                                                                  n=n())



ggplot(subset(by_wk, avg> 0 & year > 14),
       aes(as.numeric(week), avg, color=party, size=n))+ geom_line()+
  facet_grid(~year)

It seems that the avg amount of contributions are huge at the beginning of 2015, but when I added a 4th variable (n = number of contributions) it shows that these were a few outliers, the mass of the contributions came in mid 2016 as it lowered the average but the size (n) was bigger substantially.

library(choroplethrZip)
library(choroplethr)
## Loading required package: acs
## Loading required package: XML
## 
## Attaching package: 'acs'
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:base':
## 
##     apply
require(glue)
data(df_zip_demographics)
data("zip.regions")

zip.regions
az_demographics <- subset(zip.regions, region > 85001 & region < 86556) 
# create a function that selects zipcode and region for zip_choropleth analysis

clean <- function(x, y){
  region <- x
  value <- y
  new <- data.frame(region, value)
 new <- new[!duplicated(new$region), ]
}




# group by other intersting variables.
zip_avg_sum <- az %>%
  group_by(clean_zip, party) %>%
  summarise(
            average_contrib = mean(contb_receipt_amt),
            sum_of_contribs = sum(contb_receipt_amt))


# Democratic contrib heatmap
demo_zip_avg_sum <- na.omit(zip_avg_sum[zip_avg_sum$party == 'democrat',])

demo_zip_avg_sum <- clean(demo_zip_avg_sum$clean_zip, demo_zip_avg_sum$sum_of_contribs)
demo_heat <- zip_choropleth(demo_zip_avg_sum,state_zoom = 'arizona', county_zoom = 4013 )+
  ggtitle('Maricopa County Total USD Contribution (by Zipcode)', subtitle = 'Democratic Party Candidates')+
  labs(color= 'USD Amount')
## Warning in super$initialize(zip.map, user.df): Your data.frame contains
## the following regions which are not mappable: 85001, 85002, 85005, 85010,
## 85011, 85060, 85061, 85063, 85064, 85066, 85067, 85068, 85069, 85070,
## 85071, 85078, 85080, 85082, 85117, 85130, 85178, 85211, 85244, 85246,
## 85252, 85261, 85267, 85269, 85271, 85274, 85275, 85277, 85280, 85285,
## 85299, 85312, 85318, 85327, 85358, 85366, 85372, 85376, 85378, 85385,
## 85502, 85532, 85547, 85628, 85636, 85652, 85702, 85717, 85721, 85728,
## 85731, 85732, 85733, 85734, 85738, 85740, 85751, 85752, 85754, 85902,
## 86002, 86302, 86304, 86312, 86339, 86340, 86341, 86402, 86405, 86412,
## 86427, 86430, 86439
## Warning: Column `region` joining character vector and factor, coercing into
## character vector
## Warning in self$bind(): The following regions were missing and are being
## set to NA: 85343, 85337, 85309, 85322
# Republican contrib heatmap 
repub_zip_avg_sum <- na.omit(zip_avg_sum[zip_avg_sum$party == 'republican',])
repub_zip_avg_sum <- clean(repub_zip_avg_sum$clean_zip, repub_zip_avg_sum$sum_of_contribs)
repub_heat <- zip_choropleth(repub_zip_avg_sum,state_zoom = 'arizona', county_zoom = 4013)+
  ggtitle(' ',subtitle =  'Republican Party Candidates')
## Warning in super$initialize(zip.map, user.df): Your data.frame contains
## the following regions which are not mappable: 85001, 85002, 85005, 85010,
## 85011, 85030, 85046, 85060, 85063, 85064, 85066, 85067, 85068, 85069,
## 85070, 85071, 85076, 85080, 85082, 85117, 85127, 85130, 85178, 85191,
## 85211, 85214, 85216, 85227, 85230, 85236, 85244, 85246, 85252, 85261,
## 85267, 85269, 85271, 85274, 85275, 85277, 85280, 85285, 85287, 85299,
## 85311, 85312, 85318, 85327, 85358, 85359, 85366, 85369, 85372, 85376,
## 85378, 85380, 85385, 85502, 85532, 85547, 85548, 85628, 85636, 85652,
## 85702, 85703, 85717, 85728, 85731, 85732, 85733, 85734, 85740, 85751,
## 85752, 85754, 85902, 86002, 86302, 86304, 86312, 86339, 86340, 86341,
## 86342, 86402, 86405, 86427, 86430, 86439, 86446
## Warning: Column `region` joining character vector and factor, coercing into
## character vector
## Warning in self$bind(): The following regions were missing and are being
## set to NA: 85256, 85337, 85309, 85322, 85333
# why I can't change the color to red.


# problem: heatmap scales are not matching, 
grid.arrange(demo_heat, repub_heat)

 # Reasearch heatmap color and scale
# Final plot 3
require(choroplethrMaps)
## Loading required package: choroplethrMaps
df_zip_demographics <- subset(df_zip_demographics, region > 85001 & region < 86556) 

ALLdemographics_df<- merge.data.frame(df_zip_demographics, zip.regions, by.x='region', by.y='region')

ALLdemographics_df
az_county<- merge.data.frame(az,ALLdemographics_df , by.x='clean_zip', by.y = 'region')

az_county
az_county_summ <- az_county %>%
group_by(county.name, county.fips.numeric, party) %>%
  summarise(n= n(),
             average_contrib = mean(contb_receipt_amt),
            sum_of_contribs = sum(contb_receipt_amt))



# Clean  avg_contribution
az_countyP1 <- with(az_county_summ, clean(county.fips.numeric, average_contrib))
az_countyP1 <- na.omit(az_countyP1)

# Plot avg_contribution by county
county_choropleth(az_countyP1, state_zoom = 'arizona')

# group by county & party
az_county_party <- merge(x = az_county, y = az[, c("party","tran_id" )], by = "tran_id", all.x=TRUE)

az_county_party_grouped <- az_county_party %>%
  group_by(county.fips.numeric, party.x) %>%
  summarize(average = mean(contb_receipt_amt))

# Plot demo avg_contribution by county

az_countyP1Demo <- with(az_county_party_grouped[az_county_party_grouped$party.x == 'democrat',] , clean(county.fips.numeric, average))

P1Demo <-county_choropleth(na.omit(az_countyP1Demo), state_zoom = 'arizona')+
    ggtitle(label = 'Democrate Average Contribution', subtitle = 'By County')







# Plot repub avg_contribution by county

az_countyP1Repub <- with(az_county_party_grouped[az_county_party_grouped$party.x == 'republican',] , clean(county.fips.numeric, average))

P1Repub <- county_choropleth(na.omit(az_countyP1Repub), state_zoom = 'arizona')+
  ggtitle(label = 'Republican Average Contribution', subtitle = 'By County')



# Arrange the party plots 
grid.arrange(P1Demo, P1Repub)

Note: there is discrepancy in the color scale:

Brain storming: What can I do to improve?

What kind of graphs could I add? -Bar chart of most contributing jobs to the Donald

I am wondering what kind jobs contributed to Donald trump, my intuition says it’s mostly blue collar jobs. Let’s find out!

#-Bar chart of most contributing jobs to the donald

Trump_jobs<- az[az$cand_nm == "Trump, Donald J.",] %>%
group_by(job=contbr_occupation) %>%
  summarise(avg = mean(contb_receipt_amt),
            n = n())

ggplot(Trump_jobs[Trump_jobs$n >100,], aes(x=job, y=avg))+
geom_bar(stat = 'identity')

#-Bar chart of most contributing jobs to Clinton
Clinton_jobs<- az[az$cand_nm == "Clinton, Hillary Rodham",] %>%
group_by(job=contbr_occupation) %>%
  summarise(avg = mean(contb_receipt_amt),
            n = n())

ggplot(Clinton_jobs[Clinton_jobs$n >390 & Clinton_jobs$avg > 0 ,], aes(x=job, y=avg))+
geom_bar(stat = 'identity')

unique(az$cand_nm)
##  [1] Trump, Donald J.          Sanders, Bernard         
##  [3] Cruz, Rafael Edward 'Ted' Clinton, Hillary Rodham  
##  [5] Stein, Jill               Carson, Benjamin S.      
##  [7] Paul, Rand                Fiorina, Carly           
##  [9] Rubio, Marco              Johnson, Gary            
## [11] Bush, Jeb                 Kasich, John R.          
## [13] Santorum, Richard J.      McMullin, Evan           
## [15] Webb, James Henry Jr.     Huckabee, Mike           
## [17] Walker, Scott             Christie, Christopher J. 
## [19] Jindal, Bobby             O'Malley, Martin Joseph  
## [21] Lessig, Lawrence          Graham, Lindsey O.       
## [23] Perry, James R. (Rick)   
## 24 Levels: Bush, Jeb Carson, Benjamin S. ... Webb, James Henry Jr.

My hypothesis is false, most of trumps contributors have white collar jobs, even the higher income types such as engineers, consultants, physicians and CEOs. One weakness of this plot, it does not represent low income contributors.

let me see by number of contributions only if it helps me find out more,

ggplot(Trump_jobs[Trump_jobs$n >100 & Trump_jobs$n < 2000,], aes(x=job, y=n))+
geom_bar(stat = 'identity')

sum(Trump_jobs$n)/sum(cand_sum$n)
## [1] 0.1344482
ggplot(Trump_jobs[Trump_jobs$n >20 & Trump_jobs$n < 100,], aes(x=job, y=n))+
geom_bar(stat = 'identity')+
   theme(axis.text.x=element_text(angle=90,hjust=1))

The percentage of contributions for trump of the whole data-set is 18%.

By changing some of the subset filters, still the majority were high income occupations, even though we have some blue collar jobs such as truck driver and construction, but they were the minority. My hypothesis is blue-collar workers cannot afford to contribute therefore, they are underrepresented in this data-set.

I want to see if higher income zip-codes had more contributions and I will use a scatter-plot to demonstrate.

# final plots 2
# merge demographic data with original dataset
AZ_ALLdemographics_df <- merge.data.frame(az, ALLdemographics_df, by.x = 'clean_zip', by.y = 'region')
income_cotrib <- AZ_ALLdemographics_df %>% group_by(clean_zip, income= per_capita_income)%>% summarise(n = n(), total= sum(contb_receipt_amt),
                                                                                                       average = mean(contb_receipt_amt))



ggplot(income_cotrib[income_cotrib$n >100,], aes(income, total))+
  geom_point()+
  geom_smooth()+
  xlab('Median Income per Zipcode')+
  ylab('Total Population per Zipcoe')+
  ggtitle('Relationship Between Number of Median Income and Population per Zipcode')
## `geom_smooth()` using method = 'loess'

ggplot(income_cotrib[income_cotrib$n >100,], aes(income, average))+
  geom_point()+
  geom_smooth()+
  xlab('Median Income per Zipcode')+
  ylab('Average Contibution per Zipcode')+
  ggtitle('Relationship Between Average USD Amount of Contributions and Median Income', subtitle = 'Per Zipcode')
## `geom_smooth()` using method = 'loess'

There is only a strong relationship When I subsetted the data to 100 contributions at least per zip-code. Doing otherwise will skew the data and the relationship will not be apparent.

This is obvious but still I would like to see the relationship between number of contributions and population of zip-code.

#final plot 1

AZ_ALLdemographics_df
pop_contrib <- AZ_ALLdemographics_df %>%
  group_by(clean_zip, total_population)%>%
  summarize(n=n())

ggplot(pop_contrib, aes(n,total_population))+
  geom_point()+
  geom_smooth()+
  xlim(0,1000)+
  xlab('Number of Contributions')+
  ylab('Total Population')+
  ggtitle('Relationship Between Number of Contributions and Population of per Zipcode')
## `geom_smooth()` using method = 'loess'
## Warning: Removed 32 rows containing non-finite values (stat_smooth).
## Warning: Removed 32 rows containing missing values (geom_point).

There is a strong correlation at first, but then as population increases the relationship weakens.

Final Plots:

finalp1<- ggplot(subset(na.omit(by_wk), avg> 0 & year > 14),
       aes(as.numeric(week), avg, color=party, size=n))+ geom_line()+
  facet_grid(~year)+
  ggtitle('Average USD Contributed Along 2015/16 Weeks by Party')+
  xlab('Week #')+
  ylab('USD Contributed (Average)')+
  labs(color= 'Party', size = 'Number of Contributions')
  
finalp1

finalp2 <-  ggplot(pop_contrib, aes(n,total_population))+
  geom_point()+
  geom_smooth()+
  xlim(0,1000)+
  xlab('Number of Contributions')+
  ylab('Total Population')+
  ggtitle('Relationship Between Number of Contributions and Population of per Zipcode')

finalp2
## `geom_smooth()` using method = 'loess'
## Warning: Removed 32 rows containing non-finite values (stat_smooth).
## Warning: Removed 32 rows containing missing values (geom_point).

finalp3 <- ggplot(income_cotrib[income_cotrib$n >100,], aes(income, total))+
  geom_point()+
  geom_smooth()+
  xlab('Median Income per Zipcode')+
  ylab('Total Population per Zipcoe')+
  ggtitle('Relationship Between Number of Median Income and Population per Zipcode')

finalp3
## `geom_smooth()` using method = 'loess'

finalp4 <- ggplot(income_cotrib[income_cotrib$n >100,], aes(income, average))+
  geom_point()+
  geom_smooth()+
  xlab('Median Income per Zipcode')+
  ylab('Average Contibution per Zipcode')+
  ggtitle('Relationship Between Average USD Amount of Contributions and Median Income', subtitle = 'Per Zipcode')

finalp4
## `geom_smooth()` using method = 'loess'

finalp5 <-  county_choropleth(az_countyP1, state_zoom = 'arizona')



finalp5 

finalp6 <-  grid.arrange(P1Demo, P1Repub)

finalp6
## TableGrob (2 x 1) "arrange": 2 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (2-2,1-1) arrange gtable[layout]
finalp7 <- grid.arrange(demo_heat, repub_heat)

Reflection:

overall this project was a good challenge and learning experience. At first it was easy and enjoyable exploring the data, as I went deeper into the analysis it became harder to come up with relationships and conclusions about the data. I wanted my analysis to have a central theme/thesis, the fact of not drawing a certain conclusion made me feel frustrated.

I was impressed with the versatility of R, and its packages, I felt like it was more intuitive than python, maybe because I have a background with Alteryx. Although, R felt like it had less support on stackoverflow than python, but there’s support nonetheless, which aided me significantly throughout the project. I also used Datacamp for filling in the knowledge gaps and reinforcing the concepts learned in the Udacity curriculum. I have not utilized Udacity’s live help as much as the other projects, because I did not face problems with programming itself, rather than loss of ideas and direction of my analysis.

In terms of visualizations, R is fantastic for data exploration, although it is lacking the ability to export high resolution plots. I feel that Tableau is more suitable for findings/conclusive plots.